What Makes a Hollywood Movie a Hit or a Flop?

Final Project
Data Science 1 with R (STAT 301-1)

Author

Celena Kim

Published

December 8, 2023

Introduction

As an avid movie lover, I have always been curious about what factors play into making some Hollywood movies critically acclaimed blockbusters while others fade into the background. Beyond solely the ticket sales, I am interested in exploring the interplay between more extensive variables that contribute to a film’s propensity to ultimately be a hit or a flop. Specifically, I think it would be very interesting to explore this by defining a movie’s success based on the factors of critic and audience ratings, opening weekend revenue, gross (domestic, foreign, and worldwide), budget and budget recovery, and Oscar wins. I also am curious to see whether the time of year/season has an impact on a movie’s success, and if there is a particular season in which the most successful movies are released. By focusing on these main variables for my analysis, I hope to explore my research question by discovering patterns and compelling correlations between the variables on a range of univariate to multivariate levels. I am interested in exploring whether certain variables affect one another and how certain variables work together to contribute to a movie’s overall success rate. In order to carry out this analysis, I will be utilizing a data set found on the Kaggle website called “Hollywood Hits and Flops (2007 - 2023)”, described in the next section.

Data Overview and Quality

The data source that I chose for my project is called Hollywood Hits and Flops (2007 - 2023) by David McCandless found on Kaggle. This data set contains information about 1,967 Hollywood movies from 2007-2022, such as each movie’s Rotten Tomatoes score, opening weekend revenue, genre, Oscar wins, budget recovery percentage, etc., all of which are variables that contribute to a movie’s overall success rate. In its raw form, there are 16 separate CSV files, each corresponding to movie data for each year from 2007-2022. In order to conduct an EDA with all the files at once, I decided to join the 16 files together. However, while the content of the 16 files is mostly similar with relatively the same variables/columns, there are some notable differences between the files, making the joins not a 1:1 perfect fit. Moreover, there were many variable conversions and variable additions that needed to be made in order to carry out my planned explorations. With these conditions, an extensive data joining and clean-up process was required to ultimately end up with my final data set called “movie_data”. A more detailed account of this process can be found in the section of this report titled “Appendix: technical info”. “movie_data” is a 354.1 KB CSV file of 1,967 observations and 33 variables of character, double, integer, logical, and date types. For my analysis, I plan to explore this data set in its entirety by splitting up its variables into 6 main categories: “ratings”, “opening weekend success”, “gross revenue success”, “budget and budget recovery”, “Oscar wins”, and “season of release”. Additionally, I plan to use the categorical variables of “genre”, “script type”, and “distributor” within these analyses to explore how distributions may vary within these categories. In terms of missingness issues, there appears to be a significant amount of missing values for the variables “primary genre”, “oscar award detail”, “distributor company”, and “IMDb rating”. With these variables containing a lot of missing information, it may be difficult to make accurate conclusions about their distributions, which is something to keep in mind throughout this analysis. Overall, the “movie_data” data set contains a great number of various categories of success to explore to ultimately understand what the ideal conditions for a movie to be successful are, and how the variables that define a movie’s success interact with one another.

Explorations: What variables contribute to a movie’s overall success rate? How do these variables interact with eachother, and what characteristics ultimately lead to a movie being successful among various audiences?

Variable 1: Ratings

Within this data set, the 3 main movie rating measures are the Rotten Tomatoes score (audience and critic), Metacritic score (audience and critic), and IMDb rating.

Figure 1: A visualization of the change in audience and critic ratings of Hollywood movies from 2007-2022.

Figure 1 visualizes how the average of Rotten Tomatoes and Metacritic scores have changed over the years, separated by audience and critic rating groups. Overall, it appears that these ratings as a whole have slightly increased since 2007. There are many different factors that could have led to this gradual increase, perhaps that the quality of movies has improved over time, or that reviewers have become more lenient in their scores, but there is not enough information to make a concrete explanation for this increase. Additionally, the audience rating group seems to consistently give higher ratings than the critic rating group, possibly suggesting that they are less harsh when it comes to reviewing films. This analysis of ratings over the years serves to help us understand how much the behaviors of the two rating groups of audience and critics differ, as well as visualize the overall pattern of critic ratings over the years.

Figure 2: The distribution of the average of Rotten Tomatoes critic ratings and Metacritic critic ratings for movies that have won at least one Oscar and movies who have not won any Oscars.

Figure 2 makes use of two measures of movie success that are determined solely by movie critics: Oscar wins and average critic movie ratings. As can be seen in the box plot, Hollywood movies that have won at least one Oscar award have a higher average of Rotten Tomatoes and Metacritic critic ratings than those that have not won any Oscars. This correlation suggests a similar pattern between critics’ movie assessment and award recognition, in that movies that are praised enough by critics to win an Oscar are also favored highly among Rotten Tomatoes and Metacritic critics.

Figure 3: A visualization of the relationship between Rotten Tomatoes critic scores and opening weekend revenue.

A movie’s Rotten Tomatoes critic rating is typically released before the movie hits theaters, as critics sometimes have exclusive pre-release screenings. Thus, I was interested in exploring the extent to which the success of this rating influences the success of the movie’s opening weekend revenue. Figure 3 shows, however, that the correlation between these two variables is not very strong, and relatively nonexistent. There is a very slight positive association, suggesting that to some extent, as a movie’s Rotten Tomatoes critic rating increases, so does its opening weekend earnings. But, as this association is very weak and not significant at all, this means that the Rotten Tomatoes critic score does not have a drastic/direct impact on opening weekend revenue as I had initially thought.

Figure 4: The average IMDb, Metacritic, and Rotten Tomatoes critic ratings for each unique script type combination of Hollywood movies.

Figure 4 visualizes the average IMDb, Metacritic, and Rotten Tomatoes critic ratings for each of the unique script type combinations of Hollywood movies from 2007-2022. One chief idea to note is that the average IMDb rating is only available for 5 out of the 16 script types, revealing a great amount of missingness within this variable and making it difficult to reach a conclusion about the relationship between script type and average IMDb rating. For the other two rating variables, the script type with both the highest Metacritic and Rotten Tomatoes critic ratings is “documentary”, suggesting that this script type is more favorable among these critics than other script types.

Figure 5: The distribution of the deviance of audience vs. critic movie ratings, by genre.

In this data set, “audience vs. critics deviance” refers to the difference in a movie’s average of Rotten Tomatoes and Metacritic critic ratings and its average of Rotten Tomatoes and Metacritic audience ratings. A negative deviance means there was a higher audience rating than critic rating. In Figure 5, a majority of the bars on the graph being in the negative axis means that for a majority of genres, audience rating groups gave out higher ratings than critic rating groups. This makes sense, given that in Figure 1, it was concluded that audience rating groups tend to give out higher ratings than critic rating groups. Faceting by genre, the graph shows that the genre with the lowest absolute value deviance rating is the “biography” category. This suggests that audience and critic rating groups rated movies of this genre most similarly. On the other hand, the genre with the highest disparity in ratings is the “sci-fi” category, suggesting that audiences and critics differ in rating opinion the most for this genre.

Figure 6: The distribution of the deviance of Rotten Tomatoes vs. Metacritic audience group movie ratings, by genre.

In Figure 6, the deviances in Rotten Tomatoes audience ratings and Metacritic audience ratings is explored by genre. A positive deviance means there was a higher average Rotten Tomatoes audience score than Metacritic score. With all of the bars in this graph in the positive axis except for the fantasy genre, this suggests that Rotten Tomatoes audience users gave out higher ratings for movies than the Metacritic audience rating group in all genres except fantasy. Moreover, Figure 6 interestingly displays the exact opposite findings of the lowest and highest deviances that were concluded in Figure 5. This time, the genre with the lowest absolute value deviance is the “sci-fi” category, suggesting that Rotten Tomatoes and Metacritic audiences rated movies in this category most similarly, and the genre that the Rotten Tomatoes and Metacritic audiences differed in ratings the highest in is the “biography” category. With this repetition in extreme deviances for the “sci-fi” and “genre” categories from both Figure 6 and Figure 5, this could suggest that these two categories are the two genres that are the most varied in opinion.

Figure 7: A visualization of the relationship between audience and critic ratings with domestic gross for each movie genre category.

Figure 7 seeks to explore the relationship between both critic groups’ ratings and domestic gross, as well as how the genre variable plays into this relationship. Overall, there seems to be a positive correlation between a movie’s rating and its domestic gross earnings. That is, as the rating of a movie increases, its domestic gross revenue also increases. However, this correlation seems to be stronger/steeper for audience rating groups than critic rating groups, suggesting that as audience ratings increase, the domestic gross earnings increase at a higher rate than they would with critic ratings. When looking at this relationship through the lens of the different genres, the “action” genre has the steepest correlation for both plots, but it is again steeper for audience rating. This suggests that for action-type movies specifically, as their movie rating increases, the amount of domestic gross revenue earned for this type of category is greater than other movie genres. However, this domestic gross earning is greater for audience ratings as they increase for action movies, as compared to critic ratings of action movies.

Figure 8: A visualization of the movie distributors with the ten highest average critic and average audience ratings.

Figure 8 displays the movie distributors with the highest average Rotten Tomatoes and Metacritic ratings, for both audience and critic rating groups. The movie distributing company with the highest average critic rating is “A24”, and the highest average audience rating was received by “Atlas Distribution Company”. This could suggest that movies released by A24 were favored the most among critics, and movies released by Atlas Distribution Company were favored the most among audiences.

Figure 9: A comparison of the relationships between critic and audiences ratings with the percent that a movie’s revenue is earned from abroad audiences.

In Figure 9, the potential relationship of the audience and critic rating groups with the percent of a movie’s gross earnings that are earned abroad is explored. However, we see that there is virtually no correlation between the variables, although there is the slightest positive correlation between audience ratings and the percentage of a movie’s gross earned abroad. This could suggest that critic ratings have practically no link with the percent that a movie earns in its theaters abroad, but the audience ratings do seem to have a slightly stronger correlation with the percent earned abroad variable, suggesting that audience ratings may have the slightest connection to this variable, but it is not to a significant degree.

Variable 2: Opening Weekend Revenue

A movie’s opening weekend revenue refers to the total box office earnings that the film earned during its first weekend of release in theaters.

Figure 10: A visualization of the change in yearly average opening weekend revenue for Hollywood movies from 2007-2022.

Figure 10 visualizes the change in the mean opening weekend earnings (in millions) for Hollywood movies from 2007-2022. As can be seen by the graph, there are two distinct low points on the graph corresponding to the years 2008 and 2020, and these drops can be explained by the economic state of the country during those years. In 2008, the country experienced a Great Recession of economic downturn, greatly impacting the film industry. This economic crisis led to a dramatic decline in consumer spending and movie production, possibly leading to the drop in mean opening weekend earnings that we see in the graph for this year. In 2020, we see a significantly more drastic drop in mean opening weekend revenue, as the COVID-19 pandemic led to a nationwide shut down/capacity limit of movie theaters. With these conditions, there was a dramatic decline in movie theater ticket sales and thus a dramatic drop in the mean opening weekend revenue of movies released during the pandemic, as shown in the graph. These findings are certainly something to keep in mind throughout this variable analysis, as the opening weekend revenue is highly impacted by economic crises such as the 2008 Great Recession and the 2020 COVID-19 pandemic.

Figure 11: The relationship between a movie’s earnings during the first weekend of its release and its overall budget recovery earnings, compared across genres and script types.

Figure 11 explores the relationship between a movie’s opening weekend revenue and how much it earns to recover its production cost (budget recovery), categorized by script type and genre. Overall, there is a clear strong, positive correlation between opening weekend revenue and budget recovery, suggesting that as the amount of money a movie earns during the first weekend of its release in theaters increases, the amount of money it will earn to recover its budget will also increase. However, this correlation varies between each specific genre and script type. In examining genre, the ‘adventure’ category has the highest correlation. This may suggest that out of all movie genres, the adventure category earns more of its budget back as its opening weekend increases. In examining script type relationships, the ‘remake’ script type has the highest correlation, also suggesting that remakes earn a higher amount to recover their budget as their opening weekend earnings increase.

Figure 12: The top 5 movies genres and script types that earned the most revenue during their opening weekends.

Figure 12 displays that the genre combination that earned the greatest average revenue during its opening weekend of release is sci-fi & fantasy, and the script type combination that earned the greatest average revenue during its opening weekend of release is sequel & adaptation. This suggests that the movies categorized as a sci-fi fantasy genre hybrid earned more during the first weekend of their release than other genre combinations, and movies categorized as a sequel adaptation script type hybrid also earned that title.

Figure 13: The disribution of opening weekend revenue for movies that have won at least one Oscar award and movies who have won 0 Oscars.

Figure 13 shows that Hollywood movies that have won at least one Oscar award or greater have an average opening weekend revenue that is actually less than movies that have not won any Oscars. This could suggest that the mean opening weekend success of a movie does not correlate with winning an Oscar, and these two variables are unrelated to one another. In other words, having a high opening weekend revenue may not increase a movie’s chance of winning an Oscar.

Figure 14: The relationship between a Hollywood movie’s opening weekend revenue and both its domestic and foreign gross earnings.

Figure 14 displays very strong, positive correlations for both associations of domestic gross by opening weekend revenue and foreign gross by opening weekend revenue. This suggests that a Hollywood movie’s performance during its opening weekend of release has a direct positive association with its overall domestic and foreign grosses. That is, as opening weekend earnings success increases, so will domestic and foreign gross successes. Additionally, the correlation between opening weekend revenue and domestic gross seems to be slightly steeper than the correlation between opening weekend revenue and foreign gross, suggesting that opening weekend revenue performance has a slightly greater impact on its domestic gross performance than it does on its foreign gross performance.

Figure 15: The top 10 most successful movie distribution companies in terms of average opening weekend revenue success.

In examining the top distribution companies based on opening weekend performance, Figure 15 displays that the “Walt Disney Studios” movie distribution company has earned a mean opening weekend revenue that is significantly greater than other movie distributors. This suggests that movies that are released by this company have earned revenue during their opening weekends of being in theaters at a rate significantly greater than movies released by other companies. However, as mentioned in the “Data Overview and Quality” section, the movie distributor company variable contains a great amount of missingness, so this graph may not be an accurate representation of the distributors and their greatest opening weekend successes for all of the movies in the data set.

Figure 16: A visualization of the correlation between opening weekend revenue and Rotten Tomatoes audience scores.

In Figure 3, the relationship between the Rotten Tomatoes critic rating and a movie’s opening weekend success was explored, as Rotten Tomatoes critic ratings are usually revealed before the release of a movie. Now, Figure 16 follows this analysis by exploring how the opening weekend success of a movie correlates with its Rotten Tomatoes audience ratings, as audience ratings are typically made after a movie hits the theaters. While there was not a very discernible relationship between the two variables in Figure 3, Figure 16 shows a distinct, positive relationship between opening weekend revenue and Rotten Tomatoes audience score, suggesting that the more successful the opening weekend of a movie is, the higher the movie’s Rotten Tomatoes audience score will be. This establishes a relationship between opening weekend revenue and Rotten Tomatoes audience score, while the relationship between Rotten Tomatoes critic score with opening weekend revenue is definitely not as significant.

Variable 3: Domestic, Foreign, & Worldwide Gross

A Hollywood movie’s domestic gross refers to the total box office revenue earned by US audiences, foreign gross is the total box office revenue earned by audiences outside of the US, and worldwide gross is the sum of domestic and foreign gross, representing the overall box office revenue generated by a Hollywood movie globally.

Figure 17: A visualization of the change in yearly domestic gross for Hollywood movies from 2007-2022.

Figure 17 visualizes the change in the yearly average domestic gross (in millions) for Hollywood movies from 2007-2022. Just as in Figure 10, there are significant drops for the years 2008 and 2020, also due to the economy of the country during those years. With the 2008 Great Recession, declines in consumer spending due to the economic downturn directly impacted the total box office revenue of movies. With the 2020 COVID-19 pandemic, quarantining and the closing of movie theaters also led to declines in consumer spending and a direct decline in gross domestic revenue for movies. Like the opening weekend revenue variable, the domestic gross variable is heavily impacted by economic crises such as the 2008 Great Recession and the 2020 COVID-19 pandemic.

Figure 18: A visualization of the relationship between a Hollywood movie’s US box office revenue and revenue earned abroad.

Figure 18 displays a direct and strong positive correlation between the domestic gross earnings and foreign gross earnings of Hollywood movies. In other words, as the domestic gross earnings of a movie increase, its foreign gross earnings also increase. This suggests that US and foreign audiences have similar preferences in movie popularity.

Figure 19: The distributions of domestic and foreign gross for each genre of Hollywood movies.

Figure 19 seeks to explore another comparison of movie preference behavior between domestic and foreign audiences, this time by comparing gross performance among movie genres. In determining the most popular genres by highest average gross revenue between the two audiences, the “sci-fi” category has the best domestic performance, while the “action” and “adventure” categories are tied for the best foreign performance. This suggests that there is a difference in movie genre popularity between the two audiences, in that US movie audiences have a high preference for sci-fi category movies, while foreign movie audiences have a high preference for action and adventure movies. A sci-fi movie may perform better in the US than compared to foreign movie theaters, and action and adventure movies may perform better in foreign movie theaters.

Figure 20: The distributions of domestic and foreign gross for each script type of Hollywood movies.

Similar to Figure 19, Figure 20 compares the movie preference behavior between domestic and foreign audiences, this time by script type popularity. According to the distributions, the “sequel, adaptation” script type has the best performance for both domestic and foreign audiences. Unlike Figure 19, which revealed a disparity between the domestic and foreign audiences, Figure 20 reveals a way in which they are similar as they both have a stronger preference for “sequel, adaptation” movies over other script types. This category performs equally successfully among both domestic and foreign audiences by bringing in the most average gross revenues.

Figure 21: A visualization of the top 5 movie distributors with the highest domestic and foreign gross earnings.

As a final comparison of movie preference behavior between domestic and foreign audiences, Figure 21 explores the movie distributors with the top 5 highest average domestic and foreign gross revenues. For both US and foreign audiences, the movie distributor with the most successful gross performance is Walt Disney Studios. This reveals a similarity between domestic and foreign audiences in that movies distributed by Walt Disney Studios are more popular (generate more gross revenue) than movies released by other distributors. However, like in Figure 15, there is a lot of missingness in the distributor variable, so this may not be accurate and there could be another company with a higher mean gross the Walt Disney Studios.

Figure 22: A visualization of the relationship between a Hollywood movie’s worldwide gross revenue and the percent of its production budget that was recovered.

In Figure 22, there is a clear positive relationship between a movie’s worldwide gross earnings and the percentage of its budget that is recovered. This suggests that as the box office revenue of a movie increases, the amount that it earns to recovery its budget following its production/release into theaters also increases.

Figure 23: The top 10 genres with a higher proportion of revenue from foreign audiences than domestic audiences.

In examining the top genres based on popularity proportions between domestic and foreign audiences, Figure 23 reveals that the “sci-fi” genre has the greatest percent of its gross earned abroad. This suggests that this category is more popular in foreign theaters than domestic theaters to a degree greater than other genres.

Figure 24: The top 10 script types that have a higher proportion of revenue from foreign audiences than domestic audiences.

In examining the top script types based on their domestic vs. foreign gross revenue makeups, Figure 24 reveals that the “based on a true story, remake” and “sequel, adaptation” script types are tied for having the highest proportion of their gross being contributed from foreign audiences. This suggests that movies with these script type categories earn more abroad than domestically compared to the other script types.

Variable 4: Budget & Budget Recovery

A movie’s budget refers to total amount of money allocated for the production of the film, including visual effects, marketing, actor salary, set costs, etc. Budget recovery refers to the total amount of money that the film was able to earn in order to earn back these production expenses.

Figure 25: A visualization of the change in average production budgets for Hollywood movies from 2007-2022.

Figure 25 follows the same patterns as Figure 10 and Figure 17, showing that the variable of movie budget is also highly impacted by economic crises. In this graph, there are also two distinct low points corresponding to the years 2008 and 2020. With the 2008 Great Recession, financial challenges could have resulted in cost-cutting measures and a more stringent approach to budgeting for movie distributors, leading to a lower average movie budget for that year. With the 2020 COVID-19 pandemic and quarantine, film studios may have altered their production strategies of their movies by delaying the start of filmmaking, leading to an overall decline in film production and thus a decline in mean budgets for that year. From these three similar variable findings, there seems to be a common trend that a movie’s success is greatly impacted by the economy.

Figure 26: The correlation between a Hollywood movie’s production budget and the variables of opening weekend revenue and worldwide gross revenue.

In Figure 26, there is a clear positive association between a Hollywood movie’s budget and its earnings both during its opening weekend of release and overall earnings worldwide. This suggests that, on average, movies with higher production budgets tend to achieve greater box office revenue success. It can be concluded that movie budget is closely related to the variables of opening weekend revenue and worldwide gross, in that as the budget of movies increases, its opening weekend revenue earnings and worldwide gross revenue earnings also increase.

Figure 27: The distribution of movie budgets for each genre category.

Figure 27 visualizes the distribution of movie production budgets for each of the genre categories, with the fantasy genre having the highest average budget. This could be due to the fact that the production of fantasy movies usually involves elaborate visual effects, intricate makeup/costumes, computer-generated imagery (CGI), and other advanced technologies to create mythical worlds and landscapes, thus requiring substantial financial investment in technology, skilled artists, and post-production processes that contribute to an overall high average budget.

Figure 28: The distribution of movie budgets for each script type category.

Figure 28 shows the distributions of movie budgets for each script type category of Hollywood movies, with the “sequel, adaptation” having the highest average movie budget. So far, this script type category has won superlatives for many of the analyses in this EDA– for example, Figure 20 revealed that the “sequel, adaptation” has the greatest average domestic and foreign gross, and Figure 12 revealed that this same category earned the greatest average revenue during the opening weekend of its release. Given our previous findings and established understanding of the correlations between budget and the opening weekend revenue and gross variables, these findings make sense. In Figure 26, it was concluded that there exists a direct, positive association between a Hollywood movie’s budget and its gross earnings, both during opening weekend and worldwide (foreign + gross). Thus, if the “sequel, adaptation” has the highest average movie budget, then it should also have the title of greatest domestic and foreign gross and opening weekend revenue, and that is exactly what has been found thus far.

Figure 29: The movie distribution companies with the ten highest average production budgets.

Figure 29 reveals the top 10 movie distribution companies that have the highest mean movie budgets, with Walt Disney Studios having the #1 spot. This again follows the correlations and findings that have been established in the EDA thus far, as explained previously. For example, Figure 15 revealed that Walt Disney Studio has the highest average opening weekend revenue, Figure 21 displayed Walt Disney Studios as the movie distributor with the most successful gross performance, and Figure 26 ties all of these findings together by concluding that there exists a positive association between a Hollywood movie’s budget and its gross earnings, thus explaining why Walt Disney Studios’ status as the distributor with the highest budget lends it to also be the distributor with the highest opening weekend, domestic, and foreign revenues. Again, it is important to keep in mind there is a lot of missingness in the distributor variable, so there could be another company with a higher mean gross the Walt Disney Studios. However, the confirmation of previous findings and correlations still holds.

Figure 30: The distribution of budget for movies that have won at lease one Oscar award and movies who have won 0 Oscars.

Similar to Figure 13, Figure 30 shows that Hollywood movies that have won at least one Oscar award have an average production budget that is actually less than movies with no Oscar wins. This could suggest that having a high production budget does not relate to or increase the chances of a movie winning an Oscar and that having a high production budget may not be a factor taken into account when voting for Oscars.

Figure 31: The associations of movie budget with 3 movie rating measures of the average of Rotten Tomatoes and Metacritic critic ratings, the average of Rotten Tomatoes and Metacritic audience ratings, and IMDb ratings.

Figure 31 seeks to explore how a movie’s production budget is correlated with three rating measures: the average of Rotten Tomatoes and Metacritic critic scores, the average of Rotten Tomatoes and Metacritic audience scores, and IMDb ratings. For all three graphs, there seem to be positive correlations, but they are very weak as the data points are very spread out from each other. This could suggest that as movie budget increases, ratings increase, but the relationship is not very strong and movie budget is not a direct determinant of rating success.

Variable 5: Oscar Wins

According to Britannica, the Oscars, also known as the Academy Awards, is an annual awards ceremony presented by the Academy of Motion Picture Arts and Sciences. Films are presented awards in 24 different categories ranging from costume design to original song.

Figure 32: The top 10 films with the greatest number of Oscar awards.

As shown in Figure 32, the film that has earned the greatest number of Oscar awards among the top movies released from 2007-2022 is “Everything Everywhere All at Once”, with a whopping 7 Oscar awards. However, it is important to note that the Oscar variables contain a great number of missing values, and thus this may not be accurate; there could be another film with more Oscar wins. However, within the data that is present, “Everything Everywhere All at Once” has won significantly more Oscars than any other film.

Figure 33: The top 5 movies with the most Oscar wins, by genre and script type.

Figure 33 displays that the genre combination with the most Oscar wins is “biography, history”, and the script type with the most Oscar wins is “original screenplay”. This suggests that the movies categorized as “biography, history” or “original screenplay” are more successful among Oscar voters.

Figure 34: The disribution of worldwide gross revenue for movies that have won at least one Oscar award and movies who have won 0 Oscars.

Figure 34 shows that Hollywood movies that have won at least one Oscar award have an average worldwide gross earning that is greater than movies that have not won any Oscars. This could suggest a link between these two variables, in that movies that have won an Oscar also have a better worldwide box office revenue performance than movies that have not won any Oscars.

Figure 35: The top 10 Oscar categories with the highest average critic and audience ratings

Figure 35 is a table consisting of the Oscar award category combinations for movies with the highest average critic and audience ratings. It reveals that the Oscar award for “Best Supporting Actress (Patricia Arquette) has the highest mean critic rating of 99. This corresponds to a single movie in the data set,”Boyhood”. The Oscar awards of “Best Picture, Best Director, Best Original Screenplay, and Best International Feature Film” have the highest mean audience rating of 91, and these awards were won by the movie “Parasite”.

Figure 36: A visualization of average opening weekend revenues for each possible number of Oscar awards that have been won by Hollywood movies.

Figure 36 displays average opening weekend revenues among the possible numbers of Oscar awards won by Hollywood movies over the years. The greatest average opening weekend was earned by movies with 6 Oscar awards. This corresponds to a single movie, “Mad Max: Fury Road”, and the earnings are significantly higher than movies with 4 and 7 Oscar wins.

Figure 37: A visualization of average domestic and foreign revenues for each possible number of Oscar awards that have been won by Hollywood movies.

Figure 36 displays the average domestic gross revenues and average foreign gross revenues among the possible numbers of Oscar awards won by Hollywood movies. Movies that have won 1 Oscar have both the highest average domestic and highest average foreign gross. Moreover, the two plots display similar patterns. This confirms the previous finding in Figure 18 of the correlation that exists between the domestic gross and foreign gross.

Variable 6: Seasonal Release Date

These analyses seek to explore how the five main variables above vary/are impacted by the season a movie is released in, and what seasonal release date trends may exist in influencing a movie’s success rate.

Figure 38: A comparison of the mean ratings of movies based on the season they were released in, between the critic and audience rating groups.

Figure 38 shows a comparison between the average ratings for each season between the critic and audience rating groups. There appears to be a similar pattern for both rating groups’ seasonal average critic numbers, with the highest ratings given for movies released in the Fall, and the lowest ratings given for movies released in the Winter. This reveals a similarity in the seasonal patterns of movie ratings for the two rating groups. However, the taller bar graphs in the plot on the right depict a disparity between the two groups’ rating patterns in that the audience rating group gives out higher ratings than the critic rating group, as revealed in Figure 1. Figure 38 stands to visualize a way in which the rating patterns for these two groups are similar, and confirm a previous finding of a way that their patterns differ. An overall conclusion can be made that movies released in the Fall have the highest ratings, while movies released in the Winter have the lowest ratings.

Figure 39: A comparison of the average opening weekend revenues in millions of dollars for movies based on what season they were relased in.

In Figure 39, it is clear that movies with the highest average opening weekend revenue were released in the Spring. This could suggest that movies that are released in the Spring are more successful in terms of generating more earnings during their first weekend in theaters than movies released in other seasons.

Figure 40: The average total revenue generated by films from all sources globally by the season the film was released in.

Figure 40 shows that movies released during the Summer months have the highest average worldwide gross. This could be due to the fact that in many countries around the world, kids are on summer vacation during these months, and thus families are more likely to go to the movies and contribute to increased ticket sales.

Figure 41: A comparison of the average production budget of movies based on what season they were released in.

Figure 41 shows that movies released in the Spring have the highest average movie budgets. This directly aligns with previous findings in the EDA. In Figure 26, it was concluded that there exists a positive association between a Hollywood movie’s budget and its opening weeked earnings. Therefore, since Figure 39 revealed that the season of movies released with the highest average opening weekend revenue was Spring, then the season of movies released with the highest average movie budgets should also be the Spring, and that is what we see in this plot. This supports our finding of the positive correlation that exists between a movie’s budget and opening weekend revenue.

Figure 42: The correlations of movie budget and opening weekend earnings for each movie release season

In Figure 26, it was concluded that there exists is a clear positive association between a Hollywood movie’s budget and its opening weekend revenue. Figure 42 again explores this relationship, this time by season. According to the graph, the season with the steepest correlation is “Spring”. This suggests that for movies released in the Spring, as their production budget increases, their opening weekend revenue is earned at a higher rate than movies released in other seasons. This steep correlation for Spring in Figure 42 makes sense, as it was concluded in Figure 41 that movies released in the Spring have the highest average movie budgets, and Figure 39 displayed that movies released in the Spring have the highest average opening weekend revenue.

Figure 43: The distribution of Oscar wins based on what season the movie was released in.

In Figure 43, movies that were released in the Fall season won significantly more Oscars than movies released in other seasons. This is due to the fact that the Fall season is close to around the time when Oscar voting starts, and thus these films are more salient/relavent among the voters, but there is still enough time away from the start of voting for the films to gain enough popularity and traction before the awards are given out. From this, we can conclude that when defining a film’s success solely defined by the number of Oscar wins, releasing the film during the Fall season will greatly increase its chances of being successful.

Figure 44: The distribution of genre categories of the movies that were released in each season.

Figure 44 visualizes the genre distribution of the movies released in each season. According to the graph, the “action” genre made up a majority of the movies released in the Spring, Summer, and Winter, but for movies with a release date in the Fall, a majority of them were of the “crime” genre.

Figure 45: The distribution of script type categories of the movies that were released in each season.

Figure 45 displays the script type distribution of the movies released in each season. As can be seen in the graph, the “original screenplay” category makes up the majority of movies across all seasons. This could suggest that this particular script type is the most popular/common for films.

Conclusion

After this extensive analysis of the “movie_data”, numerous chief findings, key takeaways, and common patterns were brought to light. Most notably, it seems that certain variables such as opening weekend success, gross revenue, and budgets are highly impacted by economic turmoil such as the recession and the pandemic, the sequel & adaptation script type hybrid and genre categories of sci-fi, action, and adventure seem to be very popular, movies released in the Fall have a higher chance of winning an Oscar, and the overall most successful season of movie release is the Fall. Additionally, numerous positive correlations seem to be present within the data set: domestic gross by critic and audience movie ratings, budget recovery by opening weekend revenue, domestic and foreign gross by opening weekend revenue, Rotten Tomatoes audience scores by opening weekend revenue, foreign gross by domestic gross, percent of budget recovered and worldwide gross, opening weekend revenue and worldwide gross by movie budget, and finally, movie ratings by budget. All of these relationships reveal how interrelated the various factors that contribute to a movie’s success rate are, and as discovered within numerous explorations in the EDA, knowing such correlations stand as predictive models and verifications for outcomes of certain explorations (such as audience rating groups giving consistently higher ratings than the critics, and the “sequel, adaptation” having both the highest mean budget and highest opening weekend revenue since budget and weekend revenue are highly correlated). Moreover, many findings were very insightful, such as the fact that the Rotten Tomatoes critic rating does not have much influence on opening weekend success like I had initially thought. On the other hand, many conclusions were to be expected, such as the fantasy genre having the highest mean budget, Oscar winning movies having a higher average critic rating than movies with 0 Oscars, the budget variable being positively and strongly correlated with “opening weekend revenue” and “gross” variables, and audience rating groups giving out higher ratings than critic rating groups overall. Overall, this EDA serves to provide a greater understanding of the certain conditions of variables that render a movie to be overall successful. “Success” can defined in numerous ways, from ratings, to gross revenue, to opening weekend revenue, to budget and budget recovery, to Oscar wins, and among these separate variables exist profound relationships and links to one another that further the understanding of what ultimately renders a Hollywood movie to be a “hit” or a “flop”. In the future, I think it would be interesting to conduct a similar study on movies from the 1900s, and compare and contrast the factors that contribute to the success rates of movies from that era to the ones explored in this analysis of modern-day movies. For example, Rotten Tomatoes did not come about until the late 1990s, so that would not be a variable that is explored, but Oscars were around since the early 1900s, so it would be interesting to see how the findings for that variable compare between 20th and 21st century movies. This comparison of the explorations of movies from the 1900s to the explorations found in this EDA can bring to light the changes in films from then to now, contributing to an overall complex understanding of the movie industry.

References

McCandless, D. (2023, October). Hollywood Hits and Flops [2007 - 2023]. Kaggle. https://www.kaggle.com/datasets/sujaykapadnis/hollywood-hits-and-flops-2007-2023

Tikkanen, A. (2023, December 4). Academy Award. Britannica. https://www.britannica.com/art/Academy-Award

Appendix: technical info about the data joining/clean-up process

Overall, the raw form of “Hollywood Hits and Flops” required a very extensive amount of cleaning, both pre and post joining. As described in the “Data Overview and Quality” section, “Hollywood Hits and Flops” in its raw form is made up of 16 separate csv files, each pertaining to a year from 2007-2022. However, a simple joining of all 16 files was not possible due to differences between the data sets. For one, the data sets for 2007-2010 have 33 variables, 2011-2018 have 34 variables, and 2019-2022 have 35 variables. This difference in variable number is due to the fact that 2007-2010 data sets are missing the variables “financial source, if not The numbers” and “film list here”, and the 2011-2018 data sets are missing the “film list here” variable. In order to combat this difference in variable numbers between the files, I decided to simply remove those extra columns in the 2011-2022 files, as there were NA values in every observation for those variable and they would not be relevant to my analysis. The next disparity lied in variable names, such as “genre” vs “genres” and differences in spacing between names. After skimming through these differences, I made all variable names uniform between all the data sets, and was then able to use rbind to join all 16 files into one data set called “movie_data”. However, an even deeper analysis of “movie_data” revealed that even more changes needed to be carried out before I could start my analysis. First, I cleaned up the variable names by changing all variables to snake case form. Next, there were numerous variables of types that were not conducive to perform an analysis on. For example, all of the variables with numbers as their observations such as critic variables and gross revenue were all of character type rather than integer, the release date variable was a character type rather than date, and variables with that dealt with percentages were character types with the % sign included rather than integer types. After identifying these variables, I carried out the respective character conversions. Third, numerous variables had unnecessary missing values. For example, many movies had missing values for their worldwide gross, even though their domestic and foreign gross values were present and worldwide gross is just a sum of domestic and foreign gross. Moreover, many movies’ genres were missing when their primary genres were available. Therefore, I filled in these NAs by setting NAs to worldwide gross to be the sum of domestic and foreign gross, and setting NAs in the genre variable to be whatever values was in the primary genre variable, if present. Next, the Oscar winners variable would display “Oscar winner” for movies that have won at least one Oscar award and blanks for movies with 0 Oscar awards. In order to be able to carry out analyses using this variable, I decided to convert the oscar winners variable from character to boolean type, and had it display TRUE to denote Oscar wins and FALSE to denote 0 Oscar wins. Another flaw that I observed was that there were incorrect millions conversions. For many variables such as opening weekend revenue and gross revenue, they also have a “millions” version where their numbers are represented in millions to make working with the large numbers less complicated. However, when skimming the data, I noticed that many of these conversions were incorrect (dividing by 100,000 instead of 1,000,000), which resulted in numbers being unreasonably high and inaccurate. Therefore, I replaced these “millions” columns in their entirety by performing my own math to receive the correct conversions. And last but not least, the last step to my post-join cleaning process was to add two new variables: season and oscar count. Using the release date variable and a case_when function, I created the season variable to denote the season of the year (fall, winter, spring, summer) that each movie was released in in order to perform seasonal analyses. Using the oscar detail variable and an ifelse statement, I created the oscar count variable to denote the total number of Oscar awards a movie has won. Finally, after all of these changes, deletions, and insertions were carried out, my “movie_data” data set was ready for analysis.

Appendix: extra explorations

Figure 46: A visualization of the genres present in each category of movie script type.

Figure 46 displays the distribution of the different genres that make up each script type. “Action” makes up a majority of adaptations, “drama” makes up a majority of filmed based on a true story, “thriller” makes up a majority of original screenplays, “western” makes up all of remakes, and “action” makes up a majority of sequels. “Action” genre movies seem to be very prevalent.

Figure 47 visualizes how the prevalence of genres have changed over the years based on how many movies are released for each genre. Based on the graph, it seems that the genre category with the most movie releases is “action”.

Figure 48: A visualization of the changes in the number of films released in each year from 2007-2022

Figure 48 displays the changes in the number of movies that were released for each year from 2007-2022. Most notably, there was a dramatic drop in the number of films released in 2019 to 2020, which can be explained by the 2020 COVID-19 pandemic and the fact that the nation was under quarantine.

Figure 49: Top 3 movies with the highest average critic ratings.

The film with the highest average critic rating is “Boyhood”, with a mean critic rating of 99. This is an original screenplay drama released in 2014.

Figure 50: Top 3 movies with the highest average audience ratings.

The film with the highest average audience rating is “God’s Not Dead: We the People”, with a mean audience rating of 100. This is a drama of unknown script type released in 2021.

Figure 51: Top 3 movies with the highest average opening weekend revenue.

The film with the highest average opening weekend revenue is “Avengers: Endgame”, with a mean opening weekend earning of $357 million. This is an action adaptation released in 2019.

Figure 52: Top 3 movies with the highest average domestic gross revenue.

The film with the highest average domestic gross revenue is “Star Wars: Force Awakens”, with a mean domestic gross earning of $937 million. This is a sci-fi, fantasy sequel released in 2015.

Figure 53: Top 3 movies with the highest average foreign gross revenue.

The film with the highest average foreign gross revenue is “Avatar”, with a mean foreign gross earning of $2021 million. This is an action, adventure, and fantasy original screenplay released in 2009.

Figure 54: Top 3 movies with the highest average worldwide gross revenue.

The film with the highest average worldwide gross revenue is “Star Wars: Force Awakens”, with a mean worldwide gross earning of $2068 million. This is a sci-fi, fantasy sequel released in 2015.

Figure 55: Top 3 movies with the highest percentage of their revenue earned by foreign audiences.

The film with the highest percentage of its gross earned abroad is “Café Society”, with 264.84% of its gross earned abroad. This is an original screenplay drama released in 2016.

Figure 56: Top 3 movies with the highest average production budgets.

The film with the highest average budget is “Avengers: Endgame”, with a mean budget of $356 million. This is an action adaptation released in 2019

Figure 57: Top 3 movies with the highest percentage of their budget recovered.

The film with the highest percentage of their budget recovered is “Paranormal Activity”, with 1289066.67% of its budget recovered. This is a horror, mystery original screenplay drama released in 2009.

Figure 58: Top 3 movies with the most Oscar awards.

The film with the most Oscar wins is “Everything Everywhere All at Once”, with 7 Oscar awards. This is an adventure, sci-fi, fantasy, comedy of unknown script type released in 2022.